
Created by:
import pandas as pd
import numpy as np
import altair as alt
import seaborn as sns
from matplotlib import pyplot as plt
from plotnine import theme_dark, facet_grid, theme_classic, element_rect, element_line, geom_hline, geom_vline
from plotnine import ggplot, geom_point, aes, stat_smooth, facet_wrap, xlab, scale_x_log10, theme_bw, theme, element_text, theme_dark
import plotly.express as px
import plotly.offline as py
import plotly.graph_objs as go
import plotly
import plotly.figure_factory as ff
plotly.offline.init_notebook_mode()
We have to start from read the data, and transform it into something useful. We decide that we just group by id of patients, and use mean value as our new 'main' value
data = pd.read_excel('wuhan.xlsx',engine = 'openpyxl')
data["PATIENT_ID"] = data["PATIENT_ID"].fillna(method='ffill')
new_data = data.groupby("PATIENT_ID").mean()
copy_data = new_data
new_data
| age | gender | outcome | Hypersensitive cardiac troponinI | hemoglobin | Serum chloride | Prothrombin time | procalcitonin | eosinophils(%) | Interleukin 2 receptor | ... | mean corpuscular hemoglobin | Activation of partial thromboplastin time | High sensitivity C-reactive protein | HIV antibody quantification | serum sodium | thrombocytocrit | ESR | glutamic-pyruvic transaminase | eGFR | creatinine | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PATIENT_ID | |||||||||||||||||||||
| 1.0 | 73 | 1 | 0 | 19.900000 | 133.200000 | 100.220000 | 13.466667 | 0.090 | 0.780000 | NaN | ... | 32.120000 | 38.400000 | 16.433333 | 0.09 | 140.540000 | 0.170000 | 41.0 | 29.200000 | 66.700000 | 99.000000 |
| 2.0 | 61 | 1 | 0 | 6.900000 | 145.750000 | 99.275000 | 13.300000 | 0.090 | 0.050000 | 796.5 | ... | 32.000000 | 39.150000 | 27.400000 | NaN | 138.225000 | 0.317500 | 40.0 | 29.000000 | 90.400000 | 79.250000 |
| 3.0 | 70 | 2 | 0 | NaN | 115.666667 | 101.400000 | 13.600000 | 0.060 | 0.100000 | 591.0 | ... | 31.833333 | 34.800000 | 22.950000 | 0.10 | 139.766667 | 0.240000 | 47.5 | 56.666667 | 83.933333 | 63.666667 |
| 4.0 | 74 | 1 | 0 | 4.800000 | 98.000000 | 101.950000 | 16.300000 | 0.380 | 2.100000 | NaN | ... | 41.725000 | NaN | 61.350000 | 0.11 | 141.050000 | 0.207500 | 72.0 | 23.000000 | 78.100000 | 84.500000 |
| 5.0 | 29 | 2 | 0 | 5.600000 | 128.000000 | 100.950000 | 14.600000 | 0.020 | 2.166667 | 258.0 | ... | 29.866667 | NaN | 3.900000 | 0.08 | 141.900000 | 0.316667 | 13.0 | 15.000000 | 121.400000 | 56.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 371.0 | 63 | 1 | 1 | 1741.500000 | 143.000000 | 95.700000 | 14.400000 | 1.510 | 0.000000 | 758.0 | ... | 30.400000 | 43.800000 | 152.000000 | NaN | 135.800000 | 0.160000 | 19.0 | 31.000000 | 88.600000 | 81.000000 |
| 372.0 | 79 | 1 | 1 | 30.716667 | 118.700000 | 119.609091 | 17.916667 | 1.635 | 0.240000 | 1833.0 | ... | 30.010000 | 47.816667 | 232.187500 | 0.06 | 153.636364 | 0.172000 | 93.0 | 82.444444 | 16.818182 | 297.363636 |
| 373.0 | 61 | 2 | 1 | 124.800000 | 100.000000 | 102.600000 | 14.900000 | 0.560 | 0.100000 | NaN | ... | 27.600000 | 36.700000 | 205.800000 | NaN | 141.600000 | 0.180000 | NaN | 9.000000 | 101.300000 | 47.000000 |
| 374.0 | 33 | 1 | 1 | 372.400000 | 119.000000 | 124.033333 | 23.250000 | NaN | 0.000000 | 2634.0 | ... | 30.150000 | 38.700000 | 109.800000 | 0.09 | 160.400000 | 0.130000 | 19.0 | 1061.000000 | 80.933333 | 109.333333 |
| 375.0 | 68 | 1 | 1 | 48.050000 | 163.000000 | 99.725000 | 14.950000 | 1.455 | 0.000000 | 1524.0 | ... | 31.550000 | 40.600000 | 162.750000 | 0.08 | 135.600000 | 0.240000 | 39.0 | 17.666667 | 76.500000 | 91.000000 |
375 rows × 77 columns
At the begging of our adventure we have to decide which feature are at most interesting. We select most correlated features with outcome. Selected features have correlation higher than 0.5 with outcome.
cor = new_data.corr().abs()
cor_target = abs(cor["outcome"])
relevant_features = cor_target[cor_target>0.5]
# relevant_features.sort_values()
df = pd.DataFrame(relevant_features).sort_values('outcome').drop(["outcome"])
fig = px.bar(df.reset_index(), x='outcome', y='index',
hover_data=[], color='outcome',
color_continuous_scale=px.colors.diverging.Geyser,
title="Most correlated features")
fig.update_layout(
xaxis={
'title':'Corelation with outcome'},
yaxis={'title':'Blood atribiutes'})
fig.show()
From the "Most correlated functions" bar chart, we can easily read the most important blood features influencing the patient's death.
The biggest correlation have % amount of
lymphocytesandneutrophils. At the bottom of list are also % amount ofmonocytesandeosinophils. Admittedly, on the chart is onlyneutrophils count, which might suggesting that this is a white cell type which is the most important, and all correlation of other white cells is only random.Worth noting is that on list appear
age, which suggesting that some part of society is more vulnerable.
#Selecting highly correlated features
relevant_features = cor_target[cor_target>0.7].sort_values()
new_data = new_data.loc[:,relevant_features.index.insert(0,'gender').insert(0,'age')]
# new_data
Parallel coordinates plot, illustrating dependency between blood features which have correlation with outcome higher than 0.7.
fig = px.parallel_coordinates(
new_data,
color="outcome",
labels = {"age":"Age",
"gender":"Gender",
"albumin":"Albiumn[g/dl]",
"neutrophils(%)":"Neutrophils[%]",
"(%)lymphocyte":"Lymphocyte[%]",
"High sensitivity C-reactive protein":"High sensitivity C-reactive protein [mg/l]",
},
color_continuous_scale=px.colors.diverging.Geyser,
)
fig.update_layout(coloraxis_showscale=False)
# Show the plot
fig.show()
Thanks to the graph of "parallel coordinates", we can easily separate patients with a certain range of attribute values. By selecting a given range on the axes, only those patients which are within the range are highlighted. To reset the range, double-click the selected axis.
By analyzing the markers for COVID detection, we decided to examine the correlation between them and the result, and then check how age affects the values blood features.
new_data = data[["PATIENT_ID","age","gender","outcome","Lactate dehydrogenase","High sensitivity C-reactive protein","(%)lymphocyte"]]
HSC = new_data.groupby("PATIENT_ID").mean()
plt.figure(figsize=(12,10))
cor = HSC.corr()
sns.heatmap(cor,annot=True,cmap=plt.cm.Reds)
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
From previous heatmap of correlation we decide to compare influence of HSC on outcome with age and gender.
HSC['men_women'] = HSC['gender'].map({1: 'Men', 2: 'Women'})
HSC['recovered_dead'] = HSC['outcome'].map({0: 'Recovered', 1: 'Dead'})
brush = alt.selection_interval()
click = alt.selection_multi(encodings=['color'])
scale = alt.Scale(domain=['Recovered','Dead'],range=['rgb(0, 128, 128)','rgb(202, 86, 44)'])
color = alt.Color('recovered_dead:N', scale=scale,title='Outcome')
points = alt.Chart(HSC).mark_point(size=40).encode(
alt.X('age:Q',title='Age'),
alt.Y('High sensitivity C-reactive protein:Q',title="High sensitivity C-reactive protein [mg/l]"),
# size = alt.Size('men_women:N',title='Gender'),
color=alt.condition(brush, color, alt.value('lightgray')),
shape = alt.Shape('men_women:N',title="Gender"),
tooltip=['men_women:N','recovered_dead:N','age:N','(%)lymphocyte:N','Lactate dehydrogenase:N','High sensitivity C-reactive protein:N']
).add_selection(
brush
).properties(
width=1500,
).transform_filter(
click
)
bars = alt.Chart(HSC).mark_bar().encode(
x='count()',
y=alt.Y('recovered_dead:N',title='Outcome'),
color = alt.condition(click,color,alt.value('lightgrey')),
).add_selection(
click
).transform_filter(brush).properties(
width=1500,
)
alt.vconcat(
points,
bars,
data=HSC,
title="High sensitivity C-reactive protein"
)
In a healthy person the concentration is not high, it does not exceed 5 mg / l, but in COVID patients "HSC" increases strongly. Selecting the range from 0-50 mg / L to all patients' mortality rate is only ~ 0.15. However, if we choose a larger HSC range, the mortality rate increases significantly. By clicking on the red / green bar, we can easily distinguish recovered or dead patients, thanks to which we observe that with age and the concentration of "HSC", the mortality drastically increases.
brush = alt.selection_interval()
click = alt.selection_multi(encodings=['color'])
base = alt.Chart(HSC).mark_point(size=40).encode(
y=alt.Y('age:Q',title='Age'),
# size = alt.Size('men_women:N',title='Gender'),
shape = alt.Shape('men_women:N',title="Gender"),
color=alt.condition(brush, color, alt.value('lightgray')),
tooltip=['men_women:N','recovered_dead:N','age:N','(%)lymphocyte:N','Lactate dehydrogenase:N','High sensitivity C-reactive protein:N']
).add_selection(
brush
).properties(
# width=400,
# height=400
).transform_filter(
click
)
# color = alt.Color('gender:N')
bars = alt.Chart(HSC).mark_bar().encode(
x='count():Q',
y=alt.Y('recovered_dead:N',title='Outcome'),
color = alt.condition(click,color,alt.value('lightgrey')),
).add_selection(
click
).transform_filter(brush).properties(
# width=400
)
base.encode(x='Lactate dehydrogenase').properties(title='Lactate dehydrogenase [age/(IU/l)]') & bars | base.encode(x='High sensitivity C-reactive protein').properties(title='HSC [age/(mg/l)]') & bars | base.encode(x='(%)lymphocyte').properties(title='Lymphocyte [age/(%)]') & bars
For the next plot we want to see how amount of
ThrombocytocritsandSerum sodiuminfluences at probability of not surviving Covid-19 at different stages of life, and how it is different between the sexes. To do this, we have to add additional column with different stage of life.
def disc(x):
if x < 25: return 'Young'
elif x < 40: return 'Adult'
elif x < 60: return 'Middle Age'
else: return 'Senior'
sort_data = copy_data.sort_values(by = ['age'])
sort_data['gender'] = sort_data['gender'].map({1: 'Men', 2: 'Women'})
sort_data['outcome'] = sort_data['outcome'].map({0: 'Recovered', 1: 'Dead'})
sort_data['label_age'] = sort_data['age'].apply(lambda x: disc(x))
sort_data.label_age = pd.Categorical(sort_data.label_age, ordered=True, categories = ['Young', 'Adult', 'Middle Age', 'Senior'])
sort_data.head(3)
| age | gender | outcome | Hypersensitive cardiac troponinI | hemoglobin | Serum chloride | Prothrombin time | procalcitonin | eosinophils(%) | Interleukin 2 receptor | ... | Activation of partial thromboplastin time | High sensitivity C-reactive protein | HIV antibody quantification | serum sodium | thrombocytocrit | ESR | glutamic-pyruvic transaminase | eGFR | creatinine | label_age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PATIENT_ID | |||||||||||||||||||||
| 157.0 | 18 | Women | Recovered | 1.9 | 127.0 | 103.85 | 14.6 | 0.02 | 0.566667 | NaN | ... | NaN | 0.65 | 0.09 | 143.5 | 0.225 | 4.5 | 41.0 | 215.45 | 12.5 | Young |
| 213.0 | 19 | Men | Dead | 12.8 | 108.0 | 97.50 | 16.9 | 0.13 | 0.600000 | NaN | ... | NaN | 51.90 | 0.07 | 134.5 | NaN | 8.0 | 11.0 | 130.80 | 69.0 | Young |
| 102.0 | 22 | Women | Recovered | 1.9 | 138.0 | 100.80 | 15.0 | 0.03 | 0.700000 | 582.0 | ... | 43.4 | 22.60 | NaN | 140.6 | 0.200 | 16.0 | 19.0 | 127.90 | 55.5 | Young |
3 rows × 78 columns
fig = px.scatter(sort_data,
x = 'serum sodium',
y = 'thrombocytocrit',
color = 'outcome',
color_discrete_sequence=['rgb(0, 128, 128)','rgb(202, 86, 44)'],
facet_col = 'gender',
facet_row = 'label_age',
# range_y=[0, 0.6],
labels = {
"serum sodium":"Serum sodium [mmol/l]",
"thrombocytocrit":"Thrombocytocrit [%]"
},
template="plotly_white",
title="Comparing age groups, genders with Thrombocytocrit and Serum sodium."
)
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
fig.update_yaxes(title_font_size=9)
fig
At the chart above we can observe few things.
First of all their is not so much people before 40 (young and adult category). It may means that people at this stages of life have lower probability of severe disease and being in need of hospitalization or that in early days of pandemia(samples come between 2020-01-10 and 2020-02-18) there was bigger need o taken care of older people.
Secondly, seniors, which have lower ratio of thrombocyte volume to plasma(
thrombocytocrit) are in group of people of high riskThirdly, all people whom serum sodium is much above norm(145mmol/l) have died. It is only about 30 people
fig = px.scatter(sort_data,
x = 'calcium',
y = 'albumin',
color = 'outcome',
color_discrete_sequence=['rgb(0, 128, 128)','rgb(202, 86, 44)'],
template="plotly_white",
title="Comapring albumin with calcium")
fig.add_vline(x = 2.1, line_dash="dash", line_color="black")
fig.add_hline(y = 35, line_dash="dash", line_color="black")
fig.show()
On this plot we can observe that very high amount of blood samples of people, which
albuminandcalciumlevel was both below normal expected values(3.5 g/dl and 2.1 mmol/l respectively), which are represented by horizontal and vertical line, are in group of people of high risk.
At the end, we want to see how density of
eosinophilsin population is important for morality rate, and see how it is depended between female and male, in different ages.
names = sort_data.columns
for name in names[3:-3]:
sort_data[name].fillna(sort_data[name].median(), inplace = True)
sort_data
| age | gender | outcome | Hypersensitive cardiac troponinI | hemoglobin | Serum chloride | Prothrombin time | procalcitonin | eosinophils(%) | Interleukin 2 receptor | ... | Activation of partial thromboplastin time | High sensitivity C-reactive protein | HIV antibody quantification | serum sodium | thrombocytocrit | ESR | glutamic-pyruvic transaminase | eGFR | creatinine | label_age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PATIENT_ID | |||||||||||||||||||||
| 157.0 | 18 | Women | Recovered | 1.9 | 127.000000 | 103.850000 | 14.6000 | 0.020000 | 0.566667 | 693.5 | ... | 39.50 | 0.650000 | 0.09 | 143.500000 | 0.2250 | 4.5 | 41.000000 | 215.450000 | 12.500000 | Young |
| 213.0 | 19 | Men | Dead | 12.8 | 108.000000 | 97.500000 | 16.9000 | 0.130000 | 0.600000 | 693.5 | ... | 39.50 | 51.900000 | 0.07 | 134.500000 | 0.2150 | 8.0 | 11.000000 | 130.800000 | 69.000000 | Young |
| 102.0 | 22 | Women | Recovered | 1.9 | 138.000000 | 100.800000 | 15.0000 | 0.030000 | 0.700000 | 582.0 | ... | 43.40 | 22.600000 | 0.09 | 140.600000 | 0.2000 | 16.0 | 19.000000 | 127.900000 | 55.500000 | Young |
| 200.0 | 25 | Men | Recovered | 13.0 | 125.761905 | 101.800000 | 14.3125 | 0.100000 | 0.250000 | 693.5 | ... | 39.50 | 49.000000 | 0.09 | 139.941667 | 0.2150 | 30.0 | 25.000000 | NaN | NaN | Adult |
| 195.0 | 26 | Women | Recovered | 1.9 | 136.000000 | 98.200000 | 13.9000 | 0.020000 | 0.500000 | 447.0 | ... | 35.10 | 1.100000 | 0.15 | 138.200000 | 0.3000 | 4.0 | 16.000000 | 130.400000 | 48.000000 | Adult |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 247.0 | 90 | Men | Dead | 1382.4 | 110.250000 | 105.800000 | 20.3000 | 0.183333 | 0.100000 | 693.5 | ... | 54.50 | 79.150000 | 0.07 | 138.850000 | 0.0725 | 47.0 | 12.333333 | 76.666667 | 72.666667 | Senior |
| 313.0 | 91 | Men | Dead | 15.9 | 104.500000 | 105.566667 | 15.1500 | 0.115000 | 0.000000 | 1190.0 | ... | 43.25 | 140.700000 | 0.06 | 144.566667 | 0.1600 | 60.0 | 20.000000 | 87.100000 | 54.333333 | Senior |
| 309.0 | 92 | Men | Dead | 141.6 | 119.750000 | 116.533333 | 22.4000 | 1.010000 | 0.000000 | 513.0 | ... | 48.30 | 154.633333 | 0.09 | 151.533333 | 0.1300 | 39.0 | 42.666667 | 43.675000 | 132.750000 | Senior |
| 290.0 | 94 | Women | Dead | 9.9 | 121.500000 | 97.800000 | 15.4000 | 0.485000 | 0.000000 | 693.5 | ... | 34.50 | 83.400000 | 0.09 | 137.900000 | 0.1550 | 47.0 | 12.000000 | 66.200000 | 69.000000 | Senior |
| 212.0 | 95 | Men | Dead | 280.7 | 108.000000 | 109.250000 | 17.8000 | 0.600000 | 1.600000 | 2161.0 | ... | 58.30 | 78.000000 | 0.07 | 142.100000 | 0.1900 | 80.0 | 18.000000 | 26.300000 | 184.000000 | Senior |
375 rows × 78 columns
fig = px.violin(sort_data,
x="gender",
y="eosinophils(%)",
animation_frame="label_age",
color = 'outcome',
range_y=[-1, 6],
color_discrete_sequence=['rgb(0, 128, 128)','rgb(202, 86, 44)'],
points="all",
title = 'title',
template="plotly_white")
fig["layout"].pop("updatemenus")
fig.update_layout(legend_title_text=' Outcome:')
fig.show()
At the end of Selecting correlated features we stated a hypothesis, that from all white cells only neutrophils amount count.
spy_data = sort_data.copy(deep=True)
spy_data['outcome'] = spy_data['outcome'].map({'Recovered':0, 'Dead':1})
spy_data.head(3)
| age | gender | outcome | Hypersensitive cardiac troponinI | hemoglobin | Serum chloride | Prothrombin time | procalcitonin | eosinophils(%) | Interleukin 2 receptor | ... | Activation of partial thromboplastin time | High sensitivity C-reactive protein | HIV antibody quantification | serum sodium | thrombocytocrit | ESR | glutamic-pyruvic transaminase | eGFR | creatinine | label_age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PATIENT_ID | |||||||||||||||||||||
| 157.0 | 18 | Women | 0 | 1.9 | 127.0 | 103.85 | 14.6 | 0.02 | 0.566667 | 693.5 | ... | 39.5 | 0.65 | 0.09 | 143.5 | 0.225 | 4.5 | 41.0 | 215.45 | 12.5 | Young |
| 213.0 | 19 | Men | 1 | 12.8 | 108.0 | 97.50 | 16.9 | 0.13 | 0.600000 | 693.5 | ... | 39.5 | 51.90 | 0.07 | 134.5 | 0.215 | 8.0 | 11.0 | 130.80 | 69.0 | Young |
| 102.0 | 22 | Women | 0 | 1.9 | 138.0 | 100.80 | 15.0 | 0.03 | 0.700000 | 582.0 | ... | 43.4 | 22.60 | 0.09 | 140.6 | 0.200 | 16.0 | 19.0 | 127.90 | 55.5 | Young |
3 rows × 78 columns
fig3 = px.parallel_coordinates(
spy_data,
color = 'outcome',
dimensions=['White blood cell count', 'lymphocyte count', 'monocytes count',
'Eosinophil count', 'basophil count(#)', 'neutrophils count'],
labels = {'White blood cell count':'White cells cell count',
'lymphocyte count':'Lymphocyte count',
'Eosinophil count':'Eosinophil count',
'monocytes count': 'Monocytes count',
'neutrophils count':'Neutrophils count',
'basophil count(#)':'Basophil count'
},
color_continuous_scale=px.colors.diverging.Geyser,
)
fig3.update_layout(coloraxis_showscale=False)
# Show the plot
fig3.show()
It is clearly visible that
lymphocyteandmonocyteshave no impact on outcome. Values ofeosinophilandbasophilare too small and not so different to draw any conclusion. Only in the case ofneutrophilsthe amount is satisfying, and is sufficiently distinguishable to say, that amount of it important for health.